Datenvisualisierung 3

Darstellung der zusammenfassenden Statistik

Daniela Palleschi

Humboldt-Universität zu Berlin

2023-12-18

Learning objectives

Today we will learn to…

  • Boxplots zu erstellen und zu interpretieren
  • Mittelwerte und Standardabweichungen zu visualisieren

Resources

Set-up

Packages

pacman::p_load(tidyverse,
               here,
               janitor,
               ggthemes,
               patchwork)

Data

df_eng <- read_csv(
  here(
    "daten",
    "languageR_english.csv"
  )
) |> 
  clean_names() |> 
  rename(
    rt_lexdec = r_tlexdec,
    rt_naming = r_tnaming
  )

Review: Visualising distributions

  • look at each figure in Abbildung 1
    • how many variables are visualised in each
    • what types of variables are they?
    • what summary statistic(s) is/are represented in each plot?
Abbildung 1: Different plots types

Representing summary statistics

  • the mode and range are visualised in the histogram and density plots
  • the number of observations is visualised in barplots

Boxplot

  • a.k.a. box-and-whisker plots, contain
    • a box
    • line in the middle of the box
    • lines sticking out of either end of the box (the ‘whiskers’)
    • sometimes dots

  • look at Abbildung 2
    • identify each of these 4 aspects of the plot
    • can you guess what each of this might represent, and how you should interpret the plot?

Abbildung 2: Boxplot of df_eng (body mass by age_subject)

  • boxplots communicate a lot of information in a single visualisation
    • the box itself represents the interquartile range (IQR; the range of values between the middle 50% of the data lie)
      • the boundaries of the box represent Q1 (1st quartile, below which 25% of the data lie) and Q3 (3rd quartile, above which 25% of the data lie)
    • the line in the middle of the boxplot represents the median
      • also called Q2 (2nd quartile; the middle value above/below which 50% of the data lie)
    • the whiskers represent 1.5*IQR from Q1 (lower whisker) or Q3 (upper whisker)
    • Any dots that lie beyond the whiskers represent outliers (i.e., extreme values that are outside the IQR)
  • Abbildung 3 shows the relationship between a histogram versus a boxplot

Abbildung 3: Image source: Winter (2019) (all rights reserved)

  • Abbildung 4 has a similar comparison, including a scatterplot

Abbildung 4: Image source: Wickham et al. (2023) (all rights reserved)

geom_boxplot()

  • geom_boxplot() function from ggplot2 produces boxplots
    • it needs a numerical variable as the x or y axis (Abbildung 5)
df_eng |> 
  ggplot(aes(y = rt_lexdec)) +
  geom_boxplot() 

Abbildung 5: A boxplot for all observations of a continuous variable

  • for boxplots of different groups: a categorical variable along the other axis (Abbildung 6)
df_eng |> 
  ggplot(aes(x = age_subject, y = rt_lexdec)) +
  geom_boxplot() +
  theme_bw()

Abbildung 6: A boxplot for two groups

Grouped boxplot

  • we can produced grouped boxplots to visualise more variables
    • just map a new variable with colour or fill aesthetic.
df_eng |> 
  ggplot(aes(x = age_subject, y = rt_lexdec, colour = word_category)) +
  geom_boxplot() +
  labs(
    x = "Age group",
    y = "LDT reaction time (ms)",
    color = "Word type"
  ) +
  scale_colour_colorblind() +
  theme_bw()
A grouped boxplot

Visualing the mean

  • we typically also want to plot the mean with standard deviation
    • How might we do this?

Errorbar plots

  • these plots have 2 parts:
    • the mean, visualised with geom_point()
    • some measure of dispersion visualised with geom_errorbar()
  • for this course we’ll use the standard deviation
  • Abbildung 7 is what we’ll produce today
Abbildung 7: Errorbar plot of df_eng (body mass by age_subject)

Computing summary statistics

  • we need to first calculate the mean and standard deviation
    • grouped by whatever variables we want to visualise
  • how can we compute the mean and sd of rt_lexdec by age_subject?
Click here to see how
sum_eng <- df_eng |> 
  summarise(mean = mean(rt_lexdec),
            sd = sd(rt_lexdec),
            N = n(),
            .by = age_subject) |> 
  arrange(age_subject, age_subject)
  • we can then feed this summary into ggplot() with the appropriate aesthetic mapping and geoms

Plotting mean

  • let’s first plot the means using geom_point()
sum_eng |> 
  ggplot() +
  aes(x = age_subject, y = mean) +
  geom_point()

Adding errorbars

  • now let’s add our errorbars representing 1 standard deviation above and below the mean
  • we do this with geom_errorbar()
    • takes ymin and ymax as its arguments
    • for us, these will be mean-/+sd, respectively
sum_eng |> 
  ggplot() +
  aes(x = age_subject, y = mean) +
  geom_point() +
  geom_errorbar(aes(ymin = mean-sd, 
                    ymax = mean+sd))

  • if we add some further customisations, we get ?@fig-errorbar-custom
Code
sum_eng |> 
  ggplot(aes(x = age_subject, y = mean, colour = age_subject, shape = age_subject)) +
  # geom_point(data = df_eng, alpha = .4, position = position_jitterdodge(.5), aes(x = age_subject, y = rt_lexdec)) +
  geom_point(size = 3) +
  geom_errorbar(width = .5, aes(ymin=mean-sd, ymax=mean+sd)) +
  labs(title = "Mean LDT times (+/-1SD)",
    x = "Age group",
    y = "Reaction time (ms)",
    color = "Age group"
  ) +
  scale_color_colorblind() +
  theme_bw() +
  theme(
    legend.position = "none"
  )

Barplot of means: stay away!

  • you will very often see barplots of mean values
    • but there are lots of reasons why this is a bad idea!!
  • the barplot has a terrible data-ink ratio, i.e., the amount of data-ink divided by the total ink required to produce the graphic
    • What if there are very few or no observations near zero? We’re using a lot of ink where there aren’t any observations! + also, the bar only covers the space where the bottom half of the observations lie; just as many observations lie above the mean!
  • errorbars alone are not the answer: this also hides a lot of information
    • it’s a good reason to always visualise your raw datapoints regardless of what summary plot you produce

Learning objects 🏁

In this section we learned how to…

  • produce and interpret boxplots ✅
  • produce and interpret errorbar plots ✅

Homework

Boxplot with facet

  1. Produce a plot called fig_boxplot, which is a boxplot of the df_eng data, with:
    • age_subject plotted on the x axis
    • rt_naming on the y-axis
    • age_subject as colour or fill (choose one, there’s no wrong choice)
    • word_category plotted in two facets using facet_wrap()
    • whichever theme_ setting you choose (e.g., theme_bw(); for more options see here)

Errorbar plot

  1. Try to reproduce Abbildung 8. Hint: you will use the rt_naming variable from df_eng.

Abbildung 8: Plot to be reproduced

Patchwork

  1. Using the patchwork package, plot your boxplot and your errorbar plots side by side. It should look something like Abbildung 9. Hint: if you want to add the “tag levels” (“A” and “B”) to the plots, you need to add + plot_annotation(tag_level = "A") from patchwork.

Abbildung 9: Combined plots with patchwork

Session Info

Hergestellt mit R version 4.3.0 (2023-04-21) (Already Tomorrow) und RStudioversion 2023.9.0.463 (Desert Sunflower).

print(sessionInfo(),locale = F)
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] magick_2.7.4    patchwork_1.1.3 ggthemes_4.2.4  janitor_2.2.0  
 [5] here_1.0.1      lubridate_1.9.2 forcats_1.0.0   stringr_1.5.0  
 [9] dplyr_1.1.3     purrr_1.0.2     readr_2.1.4     tidyr_1.3.0    
[13] tibble_3.2.1    ggplot2_3.4.3   tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] utf8_1.2.3       generics_0.1.3   stringi_1.7.12   hms_1.1.3       
 [5] digest_0.6.33    magrittr_2.0.3   evaluate_0.21    grid_4.3.0      
 [9] timechange_0.2.0 fastmap_1.1.1    rprojroot_2.0.3  jsonlite_1.8.7  
[13] fansi_1.0.4      scales_1.2.1     cli_3.6.1        crayon_1.5.2    
[17] rlang_1.1.1      bit64_4.0.5      munsell_0.5.0    withr_2.5.0     
[21] yaml_2.3.7       parallel_4.3.0   tools_4.3.0      tzdb_0.4.0      
[25] colorspace_2.1-0 pacman_0.5.1     png_0.1-8        vctrs_0.6.3     
[29] R6_2.5.1         lifecycle_1.0.3  snakecase_0.11.0 bit_4.0.5       
[33] vroom_1.6.3      pkgconfig_2.0.3  pillar_1.9.0     gtable_0.3.4    
[37] glue_1.6.2       Rcpp_1.0.11      xfun_0.39        tidyselect_1.2.0
[41] rstudioapi_0.14  knitr_1.44       farver_2.1.1     htmltools_0.5.5 
[45] labeling_0.4.3   rmarkdown_2.22   compiler_4.3.0  

Literaturverzeichnis

Nordmann, E., McAleer, P., Toivo, W., Paterson, H., & DeBruine, L. M. (2022). Data Visualization Using R for Researchers Who Do Not Use R. Advances in Methods and Practices in Psychological Science, 5(2), 251524592210746. https://doi.org/10.1177/25152459221074654
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science (2. Aufl.).
Winter, B. (2019). Statistics for Linguists: An Introduction Using R. In Statistics for Linguists: An Introduction Using R. Routledge. https://doi.org/10.4324/9781315165547